This is the document explaining how we structured our excel-data and R-code:
----------------------------------------------------------------------------

Excel:
-----

1 - MasterSpreadsheetMain:
The file contains Raw Data (time series), our Calculations, ESG and our Regressions/Results. All is colourcoded for simplicity.

2.0 SMB30 - 7.1 OutputESG_Mid and ESG_Rand (Portifolios and their output)
The files marked with "30" contain the top and bottom 30% portfolios, and the files marked "Output" contains the monthly average returns, for each year, calculated using the code v6.R.

R-Code (There is also codespesific explanations within the code text) :
-----------------------------------------------------------------------

v6.R:
This script calculates the average monthly return for a portfolio constructed from an Excel file. This file changes for each factor, and the same applies for what the code identifies "big_stocks" and "small_stocks". As the code is now we look at the "ESG Mid and Random ptfs.xlsx" (line 9), and differentiate on "Mid" and "Random" (line 11 and 12), but if we used the "SMB.xlsx" file (line 9), we would sort on "Small" and "Big" (line 11 and 12). Why and how we sort them like this will become more apparent when you look at the Excel files containing the portfolios. All the excel files containing the portfolios contains worksheets named by year ("2002" to "2022"). The code then iterates through each worksheet (sheet_names), uses process_portfolio to compute average returns for each year’s portfolio, using a worksheet containing the monthly excess returns, and stores the results in results_list. It then filters these returns to include data only for the fiscal year (July 1st to June 30th of the following year), storing them in filtered_results_list, which is subsequently printed into another Excel file named "Output(name of factor)".

SummaryStat.R:
In this script, we import all similarly formatted "Output" files and conduct basic summary statistics on them.

Regresjon ESG.R:
This R code reads data from an Excel file named "ESG.xlsx" and fills missing data using linear regression. The processed data is then outputted to a new Excel file named "ESGreg.xlsx".
First, it imports data from "ESG.xlsx" into a dataframe. Then, it defines a function to extrapolate missing data using linear regression for each column:

  1.  It filters out rows with missing data (NA values) for each column.
  2.  For columns with sufficient data (at least 2 non-NA values), it builds a linear regression model (lm(as.formula(paste(col, "~ year")), data = available_data)).
  3.  It predicts missing values for identified years using the model (predict) and updates the dataframe (df) with these predictions, hence giving us simulated ESG scores.